This problem statement of predicting clients loan repayment ability is a Kaggle competition happened in year 2018. This competition was conducted by Home credit group. Home credit is an international consumer finance provider company founded in 1997 with operations in 9 countries. Home credit focus on responsible lending, primarily to people with little or no credit history. Home credit offer point-of-sale (POS) loans, cash loans and revolving loan products. Home credits aim is to provide innovative retail financial services with a focus on mass-retail lending and encouraging economic development through supporting domestic consumption, thereby improving living standards.
Many people struggle to get loans due to insufficient or non-existent credit histories. Unfortunately, this population is often taken advantage of by untrustworthy lenders. Home credit accesses repayment ability of this unbanked population by using variety of data including telco and transactional information. Doing so will ensure that clients capable of repayment are not rejected and that loans are given
In this case study we will use the data provided by home credit to predict client’s repayment capability. This data is as follows:
| File Name | Description | Number of features |
|---|---|---|
| Application_train.csv | Information about loan and loan applicant when they submit the application | 121 |
| Bureau.csv | Application data from previous loans that client got from other institutions reported to Credit Bureau | 17 |
| bureau balance.csv | Monthly balance of credits in Credit Bureau | 3 |
| previous application.csv | Information about the previous loan and client information at previous time | 37 |
| POS CASH balance.csv | Monthly balance of client’s previous loans in Home Credit | 8 |
| instalments payments.csv | Previous payment data related to loans | 8 |
| credit card balance.csv | Monthly balance of client’s previous credit card loans | 23 |
# Importing the required libraries and modules
import os
import zipfile
! pip install phik
import phik
from phik.report import plot_correlation_matrix
from phik import report
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings("ignore")
plt.rcParams.update({'font.size': 14})
# Libraries for kmeans
from sklearn.cluster import KMeans
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting phik
Downloading phik-0.12.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (690 kB)
|████████████████████████████████| 690 kB 26.0 MB/s
Requirement already satisfied: scipy>=1.5.2 in /usr/local/lib/python3.7/dist-packages (from phik) (1.7.3)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.7/dist-packages (from phik) (1.1.0)
Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.7/dist-packages (from phik) (1.21.6)
Requirement already satisfied: matplotlib>=2.2.3 in /usr/local/lib/python3.7/dist-packages (from phik) (3.2.2)
Requirement already satisfied: pandas>=0.25.1 in /usr/local/lib/python3.7/dist-packages (from phik) (1.3.5)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->phik) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->phik) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->phik) (1.4.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->phik) (3.0.9)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib>=2.2.3->phik) (4.1.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.1->phik) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib>=2.2.3->phik) (1.15.0)
Installing collected packages: phik
Successfully installed phik-0.12.2
# step 1
"""
To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com.
Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'.
This will trigger the download of kaggle.json, a file containing your API credentials.
# step 2 """
!pip install -q kaggle
# step 3
from google.colab import files
files.upload()
Saving kaggle.json to kaggle.json
{'kaggle.json': b'{"username":"swapnilransing","key":"c9f996bc05f41c5d07ad1f5793ebc1a6"}'}
# step 4 : creating a directory
!mkdir ~/.kaggle # !rm -rf ~/.kaggle is the command if the directory is already present
# step 5 :copy the file kaggle.json to created directory
!cp kaggle.json ~/.kaggle/
# step 6: permission for the json to act
! chmod 600 ~/.kaggle/kaggle.json
# step 7 : downloading the particular files
! kaggle competitions download -c home-credit-default-risk
Downloading home-credit-default-risk.zip to /content 99% 679M/688M [00:07<00:00, 86.3MB/s] 100% 688M/688M [00:07<00:00, 102MB/s]
# step 8 : unzipping
source_dir = '/content/'
os.chdir(source_dir)
file_names = os.listdir(source_dir)
for item in file_names:
if (item.endswith(".zip")):
file_name = os.path.abspath(item) # get full path of files
zip_ref = zipfile.ZipFile(file_name) # create zipfile object
zip_ref.extractall(source_dir) # extract file to dir
zip_ref.close() # close file
os.remove(file_name) # delete zipped file
# Function for plotting category variable
def category_var_plot(dataframe,category_var):
""" This function plots the categorical variables of dataframe.
Inputs=
dataframe= name of the dataframe
category_var= name of the categorical variable
Outputs=
1. Computes the unique categories of a variable
2. Computes the NaN sample percentage
3. percentage Count plot of variable categories
4. Binary classification perentage plot of each category of categorical variable
5.pie chart of percentage count of defaulters in dataframe variable
"""
# 1. Computing the unique categories of a variable
print("Number of unique category values of {} categorical variable are {}".format(category_var,dataframe[category_var].nunique()))
nunique=dataframe[category_var].nunique()
print('-'*185)
# 2. Computing the NaN sample percentage
list_unique_cat=dataframe[category_var].unique().tolist()
counter=0
for i in list_unique_cat:
if str(i)=='nan':
print("Dataframe has NaN rows.")
df=dataframe[dataframe[category_var].isna()]
print("Percentage of NaN rows for {} categorical variable are {} %".format(category_var,round(df["SK_ID_CURR"].count()/dataframe["SK_ID_CURR"].count()*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(df["TARGET"].value_counts().index.tolist()[0],round(df["TARGET"].value_counts()[0]/df["SK_ID_CURR"].count()*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(df["TARGET"].value_counts().index.tolist()[1],round(df["TARGET"].value_counts()[1]/df["SK_ID_CURR"].count()*100,2)))
print("Applicants labelled 1 are defaulters.")
counter=1
break
if counter==0:
print("Dataframe does not have NaN rows.")
print('-'*185)
# 3. percentage Count plot of variable categories
df=dataframe.groupby(category_var)["SK_ID_CURR"].count()
df = df.rename('percent').reset_index()
df['percent'] = (df['percent'] / df['percent'].sum()) * 100
ylim=df['percent'].max()+5
g = sns.catplot(x=category_var,y='percent',kind='bar',data=df)
g.ax.set_ylim(0,ylim)
g.fig.set_figheight(4.5)
g.fig.set_figwidth(29)
g.set_xticklabels(rotation=90)
for p in g.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x() # controls the height
txt_y = p.get_height() # controls the x axis position
g.ax.text(txt_x,txt_y+0.5,txt)
plt.title("Percentage distribution of each category in {} variable".format(category_var))
plt.ylabel("Percentage(%)")
plt.xlabel(category_var)
plt.show()
print('-'*185)
# 4. Binary classification perentage plot of each category of categorical variable
df=dataframe.groupby(category_var)["TARGET"].value_counts(normalize=True)
df=df.mul(100)
df = df.rename('percent').reset_index()
df_non_defaulters=df[df["TARGET"]==0]
df_defaulters=df.groupby(category_var)["TARGET"].sum()
df_defaulters=df_defaulters.mul(100)
df_defaulters=df_defaulters.rename('percent').reset_index()
# set the figure size
plt.figure(figsize=(24,4.5))
# bar chart 1 -> top bars (group of 'Defaulters')
bar1 = sns.barplot(x=category_var, y="percent", data=df_defaulters, color='darkblue')
# bar chart 2 -> bottom bars (group of 'Non-Defaulters')
bar2 = sns.barplot(x=category_var, y="percent", data=df_non_defaulters, color='lightblue')
# add legend
top_bar = mpatches.Patch(color='darkblue', label='Defaulters')
bottom_bar = mpatches.Patch(color='lightblue', label='Non-Defaulters')
plt.legend(handles=[top_bar, bottom_bar])
# show the graph
plt.title("Binary classification perentage of each category for {} categorical variable".format(category_var))
plt.ylabel("Percentage(%)")
plt.xlabel(category_var)
plt.ylim(0,130)
plt.xticks(rotation=90)
plt.show()
print('-'*185)
# 5.pie chart of percentage count of defaulters in dataframe variable
# dataset preperation for pie chart
df=dataframe.groupby(category_var)["TARGET"].value_counts()
df = df.rename('count').reset_index()
df_defaulters_pie=df[df["TARGET"]==1]
# plotting
plt.figure(figsize=(10, 10))
#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:nunique]
#create pie chart
plt.pie(df_defaulters_pie["count"], labels = df_defaulters_pie[category_var], colors = colors, autopct='%.0f%%')
plt.title("Defaulters Perentage of Each Category For {} Categorical Variable".format(category_var))
plt.show()
# Function for plotting continuous variable
def plot_continuous_variables(data, column_name, scale="No", scale_range=(0,5),box_scale=(0,5)):
""" This function plots the continuous variables of dataframe.
Inputs=
data= name of the dataframe
column_name= name of the continuous variable
scale=If you want to set the scale for the output plots, please mention Yes
scale_range= Range for probability density plot, default is (0,5) if scale!=No
box_scale= Range of box plot, default is (0,5) if scale!=No
Outputs=
1. Computes the NaN sample percentage
2. Probability density plot
3. Box plot
"""
# 1. Computes the NaN sample percentage
count=data[data[column_name].isna()].shape[0]
if count!=0:
print("Continuous variable {} has {} NaN rows.".format(column_name,count))
df=data[data[column_name].isna()]
print("Percentage of NaN rows for {} continuous variable are {} %".format(column_name,round(df["SK_ID_CURR"].count()/data["SK_ID_CURR"].count()*100,2)))
if df["TARGET"].value_counts().shape[0]>1:
print("Percentage of {} labels for NaN value row is {} %".format(df["TARGET"].value_counts().index.tolist()[0],round(df["TARGET"].value_counts()[0]/df["SK_ID_CURR"].count()*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(df["TARGET"].value_counts().index.tolist()[1],round(df["TARGET"].value_counts()[1]/df["SK_ID_CURR"].count()*100,2)))
print("Applicants labelled 1 are defaulters.")
else:
print("All the labels of NaN rows are ",df["TARGET"].value_counts().index.tolist()[0])
else:
print("Continuous variable {} does not have NaN rows.".format(column_name))
print('-'*185)
sns.set_style('whitegrid')
plt.figure(figsize=(20,8))
# 2. Probability density plot
plt.subplot(1, 2, 1)
plt.subplots_adjust(wspace=0.25)
sns.distplot(data[column_name][data['TARGET'] == 0].dropna(),label='Non-Defaulters', hist = False, color='red')
sns.distplot(data[column_name][data['TARGET'] == 1].dropna(),label='Defaulters', hist = False, color='black')
plt.xlabel(column_name)
plt.ylabel('Probability Density')
plt.legend(fontsize='medium')
plt.title("Dist-Plot of {}".format(column_name))
if scale !="No":
plt.xlim(scale_range)
# 3. Box plot
plt.subplot(1, 2, 2)
plt.subplots_adjust(wspace=0.25)
g =sns.boxplot(x='TARGET', y=column_name, data=data)
plt.title("Box-Plot of {}".format(column_name))
g.set_xticklabels(['Non-Defaulters','Defaulters'])
if scale !="No":
plt.ylim(box_scale)
plt.show()
# Function for printing the percentile values of continuous variable
def print_percentiles(data, column_name, percentiles = None):
'''
Function to print percentile values for given column
Inputs=
data= DataFrame name
column_name= Column's name whose percentiles are to be printed percentiles: list, default = None
The list of percentiles to print, if not given, default are printed
Outputs=
1. Prints percentile for the default or mentioned percentile values
'''
print('-'*185)
if not percentiles:
percentiles = list(range(0,80,25)) + list(range(90,101,2))
for i in percentiles:
print(f'The {i}th percentile value of {column_name} is {np.percentile(data[column_name].dropna(), i)}')
print("-"*185)
# Function to plot bar plot of percentage NaN values in a dataframe
def plot_nan_pct(dataframe,title):
'''
Function plots bar plot representing percentage of NaN values in a dataframe.
This function first computes all the variable/columns of dataframe consisting of NaN
values and then computes the corresponding percentage and plots that percentage values.
Inputs=
dataframe= DataFrame name
title= data frame name entered as string (This will be used in bar plot title)
Outputs=
1. Bar plot ot percentage of NaN values in a dataframe
'''
nan_col_name=[]
nan_val_count=[]
nan_value_dict={}
# prepering a dictionary of columns and correponding NaN percentage
for i in range(dataframe.shape[1]):
count=round(dataframe[dataframe[dataframe.columns[i]].isna()].shape[0]/dataframe.shape[0]*100,2)
if count!=0:
nan_value_dict[dataframe.columns[i]]=count
# sorting the dict in reverse order and storing the column name and NaN percentage in lists
for w in sorted(nan_value_dict, key=nan_value_dict.get, reverse=True):
nan_val_count.append(nan_value_dict[w])
nan_col_name.append(w)
if len(nan_val_count)>0:
print("Number of variables having NaN samples are ",len(nan_col_name))
# generating the plot
fig = plt.figure(figsize = (25, 5))
# creating the bar plot
plt.bar(nan_col_name, nan_val_count, color ='maroon')
plt.xlabel("Variable Name")
plt.ylabel("Percentage (%)")
plt.title("Percentage of NaN values in "+title)
plt.xticks(rotation = 90)
plt.show()
else:
print("Dataframe {} does not have any NaN variable".format(title))
def pairPlotFn(dataframe,NameString):
g = sns.pairplot(df, hue="TARGET", markers=["o", "s"],size=4)
#g.fig.set_figheight(15)
#g.fig.set_figwidth(15)
plt.suptitle("Pair Plot of {} Dataframes Top Continuous Variables".format(NameString), size = 22,y=1.02)
plt.show()
For data analysis of the provided data, we have followed the steps as follows:
columns=pd.read_csv("HomeCredit_columns_description.csv",encoding='latin1')
columns
| Unnamed: 0 | Table | Row | Description | Special | |
|---|---|---|---|---|---|
| 0 | 1 | application_{train|test}.csv | SK_ID_CURR | ID of loan in our sample | NaN |
| 1 | 2 | application_{train|test}.csv | TARGET | Target variable (1 - client with payment diffi... | NaN |
| 2 | 5 | application_{train|test}.csv | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 3 | 6 | application_{train|test}.csv | CODE_GENDER | Gender of the client | NaN |
| 4 | 7 | application_{train|test}.csv | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
| ... | ... | ... | ... | ... | ... |
| 214 | 217 | installments_payments.csv | NUM_INSTALMENT_NUMBER | On which installment we observe payment | NaN |
| 215 | 218 | installments_payments.csv | DAYS_INSTALMENT | When the installment of previous credit was su... | time only relative to the application |
| 216 | 219 | installments_payments.csv | DAYS_ENTRY_PAYMENT | When was the installments of previous credit p... | time only relative to the application |
| 217 | 220 | installments_payments.csv | AMT_INSTALMENT | What was the prescribed installment amount of ... | NaN |
| 218 | 221 | installments_payments.csv | AMT_PAYMENT | What the client actually paid on previous cred... | NaN |
219 rows × 5 columns
pd.options.display.max_colwidth = 200
print(columns["Description"][columns["Row"]=='TARGET'])
1 Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) Name: Description, dtype: object
HomeCredit_columns_description.csv has all columns information of all the tables.
# Pie chart for Target Variable
application_train=pd.read_csv("application_train.csv")
y = np.array([application_train["TARGET"].value_counts()[0], application_train["TARGET"].value_counts()[1]])
mylabels = ["Non Defaulters", "Defaulters"]
plt.figure(figsize=(8,8))
plt.pie(y, labels = mylabels, startangle = 90,autopct='%.0f%%')
plt.show()
Observations and Conclusions:
application_train=pd.read_csv("application_train.csv")
application_train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 122 columns
print("Shape of Application Train dataframe is ",application_train.shape)
print("Number of unique values of SK_ID_CURR in Application Train dataframe are ",application_train["SK_ID_CURR"].nunique())
print("Target labels and corresponding counts are \n",application_train["TARGET"].value_counts())
print("Percentage of non defaulted loans are {}%".format(round(application_train["TARGET"].value_counts()[0]/application_train["SK_ID_CURR"].nunique()*100,2)))
print("Percentage of defaulted loans are {}%".format(round(application_train["TARGET"].value_counts()[1]/application_train["SK_ID_CURR"].nunique()*100,2)))
Shape of Application Train dataframe is (307511, 122) Number of unique values of SK_ID_CURR in Application Train dataframe are 307511 Target labels and corresponding counts are 0 282686 1 24825 Name: TARGET, dtype: int64 Percentage of non defaulted loans are 91.93% Percentage of defaulted loans are 8.07%
#NaN variable percentatge in dataframe
plot_nan_pct(application_train,"Application Train")
Number of variables having NaN samples are 64
Observations and Conclusions:
application_test=pd.read_csv("application_test.csv")
application_test
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
48744 rows × 121 columns
print("Common id's between application_train.csv and application_test.csv",len(list(set(application_train["SK_ID_CURR"]) & set(application_test["SK_ID_CURR"]))))
Common id's between application_train.csv and application_test.csv 0
Observations and Conclusions:
Lets understand the correaltion between application_train.csv columns and target label. For numerical variables we will use pearson correlation coefficient and for categorical variables we will use phi-k correlation coefficient.
# Getting the categorical data column names
# As number of data columns are high in application_train dataframe, we will plot the number of unique values
# in each dataframe and use threshold from the observations to seperate the categorical and continuous variables
application_train_columns=list(application_train.columns)
application_train_columns_nunique=[]
for i in application_train_columns:
application_train_columns_nunique.append(application_train[i].nunique())
# bar plot of application train columns and corresponding unique values
fig = plt.figure(figsize = (50, 10))
# creating the bar plot
plt.bar(application_train_columns, application_train_columns_nunique, color ='maroon',
width = 0.4)
plt.xlabel("Column Names")
plt.ylabel("Value")
plt.title("Unique values of application train columns")
plt.ylim(0,200)
plt.xticks(rotation = 90)
plt.show()
# using a threshold value of unique values per column to categorize the column as categorical column
application_train_categorical_col=[]
for i in range(len(application_train_columns_nunique)):
if application_train_columns_nunique[i]<100: # 100 is a threshold value used
application_train_categorical_col.append(application_train_columns[i])
print("Total number of categorical variables in the application columns are ",len(application_train_categorical_col))
print('These columns are ',application_train_categorical_col)
Total number of categorical variables in the application columns are 74 These columns are ['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
Observations and Conclusions:
As the number of data columns are high, it is difficult to visualize the column names. However, it can be clearly seen that, columns with unique values lesser than 100 can be called as categorical columns.
After manual checking of values of each column, it is observed that following are not the categorical columns: 'REGION_POPULATION_RELATIVE', 'OWN_CAR_AGE', 'HOUR_APPR_PROCESS_START', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI'.
Lets refine the categorical column variables and use the correlation matrix to understand the relation of categorical variables wrt target variable.
# Selecting the categorical columns
not_cat_col=[ 'REGION_POPULATION_RELATIVE', 'OWN_CAR_AGE', 'HOUR_APPR_PROCESS_START', 'ELEVATORS_MODE', \
'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI']
application_train_categorical_col=[x for x in application_train_categorical_col if not x in not_cat_col]
print("Total number of categorical variables in the application columns are ",len(application_train_categorical_col))
print('These columns are ',application_train_categorical_col)
# Categorical Application Train dataframe
application_train_cat=application_train[application_train_categorical_col]
Total number of categorical variables in the application columns are 63 These columns are ['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Computing phi-k correlation for categorical variables
phik_overview = application_train_cat.phik_matrix()
phik_overview.round(2)
# Plotting phi-k correlation
mask_array = np.ones(phik_overview.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(20, 20))
cmap = sns.color_palette("Blues")
sns.heatmap(phik_overview,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation $\phi_K$ for application train categorical variables")
plt.show()
print("-"*185)
application_top_categorical_var_lst=phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between categorical variable and Target variable sorted in descending order are :\n")
phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
interval columns not set, guessing: ['TARGET', 'CNT_CHILDREN', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Correlation between categorical variable and Target variable sorted in descending order are :
TARGET 1.000000 OCCUPATION_TYPE 0.102846 ORGANIZATION_TYPE 0.089164 NAME_INCOME_TYPE 0.084831 REG_CITY_NOT_WORK_CITY 0.079946 FLAG_EMP_PHONE 0.072087 REG_CITY_NOT_LIVE_CITY 0.069588 FLAG_DOCUMENT_3 0.069525 NAME_FAMILY_STATUS 0.056043 NAME_HOUSING_TYPE 0.051107 LIVE_CITY_NOT_WORK_CITY 0.050956 Name: TARGET, dtype: float64
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
application_train_categorical_col.remove('TARGET')
application_train_continuous=application_train.drop(application_train_categorical_col,axis=1)
application_train_continuous=application_train_continuous.drop(["SK_ID_CURR"],axis=1)
# correlation and plotting
corr=application_train_continuous.corr()
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(25, 20))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for application train continuous variables")
plt.show()
print("-"*185)
application_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 EXT_SOURCE_3 0.178919 EXT_SOURCE_2 0.160472 EXT_SOURCE_1 0.155317 DAYS_BIRTH 0.078239 DAYS_LAST_PHONE_CHANGE 0.055218 DAYS_ID_PUBLISH 0.051457 DAYS_EMPLOYED 0.044932 FLOORSMAX_AVG 0.044003 FLOORSMAX_MEDI 0.043768 FLOORSMAX_MODE 0.043226 Name: TARGET, dtype: float64
Observations and Conclusions:
We will analyze plots of the top 10 categorical features of Application Train dataframe, which are highly correlated with target variable
# Occupation type
category_var_plot(application_train,application_top_categorical_var_lst[1])
Number of unique category values of OCCUPATION_TYPE categorical variable are 18 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for OCCUPATION_TYPE categorical variable are 31.35 % Percentage of 0 labels for NaN value row is 93.49 % Percentage of 1 labels for NaN value row is 6.51 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# ORGANIZATION_TYPE
category_var_plot(application_train,application_top_categorical_var_lst[2])
Number of unique category values of ORGANIZATION_TYPE categorical variable are 58 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# NAME_INCOME_TYPE
category_var_plot(application_train,application_top_categorical_var_lst[3])
Number of unique category values of NAME_INCOME_TYPE categorical variable are 8 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# REG_CITY_NOT_WORK_CITY
category_var_plot(application_train,application_top_categorical_var_lst[4])
# REG_CITY_NOT_LIVE_CITY
category_var_plot(application_train,application_top_categorical_var_lst[6])
# LIVE_CITY_NOT_WORK_CITY
category_var_plot(application_train,application_top_categorical_var_lst[10])
Number of unique category values of REG_CITY_NOT_WORK_CITY categorical variable are 2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of unique category values of REG_CITY_NOT_LIVE_CITY categorical variable are 2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of unique category values of LIVE_CITY_NOT_WORK_CITY categorical variable are 2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# FLAG_EMP_PHONE
category_var_plot(application_train,application_top_categorical_var_lst[5])
Number of unique category values of FLAG_EMP_PHONE categorical variable are 2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# FLAG_DOCUMENT_3
category_var_plot(application_train,application_top_categorical_var_lst[7])
Number of unique category values of FLAG_DOCUMENT_3 categorical variable are 2 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# NAME_FAMILY_STATUS
category_var_plot(application_train,application_top_categorical_var_lst[8])
Number of unique category values of NAME_FAMILY_STATUS categorical variable are 6 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# NAME_HOUSING_TYPE
category_var_plot(application_train,application_top_categorical_var_lst[9])
Number of unique category values of NAME_HOUSING_TYPE categorical variable are 6 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# EXT_SOURCE_3
plot_continuous_variables(application_train,application_top_continuous_var_lst[1])
Continuous variable EXT_SOURCE_3 has 60965 NaN rows. Percentage of NaN rows for EXT_SOURCE_3 continuous variable are 19.83 % Percentage of 0 labels for NaN value row is 90.69 % Percentage of 1 labels for NaN value row is 9.31 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# EXT_SOURCE_2
plot_continuous_variables(application_train,application_top_continuous_var_lst[2])
Continuous variable EXT_SOURCE_2 has 660 NaN rows. Percentage of NaN rows for EXT_SOURCE_2 continuous variable are 0.21 % Percentage of 0 labels for NaN value row is 92.12 % Percentage of 1 labels for NaN value row is 7.88 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# EXT_SOURCE_1
plot_continuous_variables(application_train,application_top_continuous_var_lst[3])
Continuous variable EXT_SOURCE_1 has 173378 NaN rows. Percentage of NaN rows for EXT_SOURCE_1 continuous variable are 56.38 % Percentage of 0 labels for NaN value row is 91.48 % Percentage of 1 labels for NaN value row is 8.52 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the clients age in days at the time of application. THese days are relative to the application and are negative values.
Clients age in days = birth data - application date
For interpretation of results we will convert the clients age in positive year values.
# DAYS_BIRTH converted to age_birth
application_train["age_birth"]=application_train[application_top_continuous_var_lst[4]].abs()/365
plot_continuous_variables(application_train,"age_birth")
application_train=application_train.drop(["age_birth"],axis=1)
Continuous variable age_birth does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable tells how many days before the application did client changed the phone.
DAYS_LAST_PHONE_CHANGE = phone change data - application date
For interpretation of results we will convert the DAYS_LAST_PHONE_CHANGE in positive year values.
# DAYS_LAST_PHONE_CHANGE to years last phone change
application_train["Year_LAST_PHONE_CHANGE"]=application_train[application_top_continuous_var_lst[5]].abs()/365
plot_continuous_variables(application_train,"Year_LAST_PHONE_CHANGE")
application_train=application_train.drop(["Year_LAST_PHONE_CHANGE"],axis=1)
Continuous variable Year_LAST_PHONE_CHANGE has 1 NaN rows. Percentage of NaN rows for Year_LAST_PHONE_CHANGE continuous variable are 0.0 % All the labels of NaN rows are 0 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable tells how many days before the application did client changed identity document.
DAYS_ID_PUBLISH = ID change date - application date
For interpretation of results we will convert the DAYS_ID_PUBLISH in positive year values.
# DAYS_ID_PUBLISH to Year_ID_PUBLISH
application_train["Year_ID_PUBLISH"]=application_train[application_top_continuous_var_lst[6]].abs()/365
plot_continuous_variables(application_train,"Year_ID_PUBLISH")
application_train=application_train.drop(["Year_ID_PUBLISH"],axis=1)
Continuous variable Year_ID_PUBLISH does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable tells how many days before the application did client started the current employment.
DAYS_EMPLOYED = Current employment start date - application date
For interpretation of results we will convert the DAYS_EMPLOYED in positive year values.
# DAYS_EMPLOYED to YEARS_EMPLOYED
application_train["YEARS_EMPLOYED"]=application_train[application_top_continuous_var_lst[7]].abs()/365
print_percentiles(application_train,"YEARS_EMPLOYED",percentiles=[0,25,50,75,80,85,90,100])
plot_continuous_variables(application_train,"YEARS_EMPLOYED","Yes",(-200,200),(0,50))
application_train=application_train.drop(["YEARS_EMPLOYED"],axis=1)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The 0th percentile value of YEARS_EMPLOYED is 0.0 The 25th percentile value of YEARS_EMPLOYED is 2.5561643835616437 The 50th percentile value of YEARS_EMPLOYED is 6.079452054794521 The 75th percentile value of YEARS_EMPLOYED is 15.635616438356164 The 80th percentile value of YEARS_EMPLOYED is 25.172602739726027 The 85th percentile value of YEARS_EMPLOYED is 1000.6657534246575 The 90th percentile value of YEARS_EMPLOYED is 1000.6657534246575 The 100th percentile value of YEARS_EMPLOYED is 1000.6657534246575 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Continuous variable YEARS_EMPLOYED does not have NaN rows. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# FLOORSMAX_AVG, FLOORSMAX_MEDI, FLOORSMAX_MODE
plot_continuous_variables(application_train,application_top_continuous_var_lst[8])
plot_continuous_variables(application_train,application_top_continuous_var_lst[9])
plot_continuous_variables(application_train,application_top_continuous_var_lst[10])
Continuous variable FLOORSMAX_AVG has 153020 NaN rows. Percentage of NaN rows for FLOORSMAX_AVG continuous variable are 49.76 % Percentage of 0 labels for NaN value row is 90.81 % Percentage of 1 labels for NaN value row is 9.19 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Continuous variable FLOORSMAX_MEDI has 153020 NaN rows. Percentage of NaN rows for FLOORSMAX_MEDI continuous variable are 49.76 % Percentage of 0 labels for NaN value row is 90.81 % Percentage of 1 labels for NaN value row is 9.19 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Continuous variable FLOORSMAX_MODE has 153020 NaN rows. Percentage of NaN rows for FLOORSMAX_MODE continuous variable are 49.76 % Percentage of 0 labels for NaN value row is 90.81 % Percentage of 1 labels for NaN value row is 9.19 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions:
df=application_train[["TARGET","EXT_SOURCE_3","EXT_SOURCE_2","EXT_SOURCE_1"]]
pairPlotFn(df,"Application Train")
Observations and Conclusions:
This dataframe consists of client's previous credits provided by other financial institutions that were reported to Credit Bureau. For every loan in sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
bureau=pd.read_csv("bureau.csv")
bureau
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.00 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.00 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.50 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.00 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.00 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1716423 | 259355 | 5057750 | Active | currency 1 | -44 | 0 | -30.0 | NaN | 0.0 | 0 | 11250.00 | 11250.0 | 0.0 | 0.0 | Microloan | -19 | NaN |
| 1716424 | 100044 | 5057754 | Closed | currency 1 | -2648 | 0 | -2433.0 | -2493.0 | 5476.5 | 0 | 38130.84 | 0.0 | 0.0 | 0.0 | Consumer credit | -2493 | NaN |
| 1716425 | 100044 | 5057762 | Closed | currency 1 | -1809 | 0 | -1628.0 | -970.0 | NaN | 0 | 15570.00 | NaN | NaN | 0.0 | Consumer credit | -967 | NaN |
| 1716426 | 246829 | 5057770 | Closed | currency 1 | -1878 | 0 | -1513.0 | -1513.0 | NaN | 0 | 36000.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -1508 | NaN |
| 1716427 | 246829 | 5057778 | Closed | currency 1 | -463 | 0 | NaN | -387.0 | NaN | 0 | 22500.00 | 0.0 | NaN | 0.0 | Microloan | -387 | NaN |
1716428 rows × 17 columns
# lets understand the common id's between application_train.csv and bureau
com_len_train=len(list(set(application_train["SK_ID_CURR"]) & set(bureau["SK_ID_CURR"])))
print("Common id's between application_train and bureau are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(application_train["SK_ID_CURR"]))*100,2))
Common id's between application_train and bureau are 263491 Corresponding percentage is 85.69
#NaN variable percentatge in dataframe
plot_nan_pct(bureau,"bureau")
Number of variables having NaN samples are 6
Observations and Conclusions:
# K mean clustering on the dataframe
Lets assign the "TARGET"variable to each bureau sample. We will use application_train and bureaus common ID to assign the Target label. We will remove the dataframe samples for which target label is not assigned.
After merging we will undesrstand the correlation betwein target variable and categorical variables and between target variable and continuous variables.
print("Merging TARGET with bureau dataframe")
bureau_merged = application_train.iloc[:,:2].merge(bureau, on = 'SK_ID_CURR', how = 'left')
print("Percentage of {} labeled rows is {} %".format(bureau_merged["TARGET"].value_counts().index.tolist()[0],round(bureau_merged["TARGET"].value_counts()[0]/bureau_merged.shape[0]*100,2)))
print("Percentage of {} labeled rows is {} %".format(bureau_merged["TARGET"].value_counts().index.tolist()[1],round(bureau_merged["TARGET"].value_counts()[1]/bureau_merged.shape[0]*100,2)))
print("Applicants labelled 1 are defaulters.")
print("-"*100)
bureau_merged.head()
Merging TARGET with bureau dataframe Percentage of 0 labeled rows is 92.12 % Percentage of 1 labeled rows is 7.88 % Applicants labelled 1 are defaulters. ----------------------------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 6158904.0 | Closed | currency 1 | -1125.0 | 0.0 | -1038.0 | -1038.0 | NaN | 0.0 | 40761.0 | NaN | NaN | 0.0 | Credit card | -1038.0 | 0.0 |
| 1 | 100002 | 1 | 6158905.0 | Closed | currency 1 | -476.0 | 0.0 | NaN | -48.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | Credit card | -47.0 | NaN |
| 2 | 100002 | 1 | 6158906.0 | Closed | currency 1 | -1437.0 | 0.0 | -1072.0 | -1185.0 | 0.000 | 0.0 | 135000.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -1185.0 | 0.0 |
| 3 | 100002 | 1 | 6158907.0 | Closed | currency 1 | -1121.0 | 0.0 | -911.0 | -911.0 | 3321.000 | 0.0 | 19071.0 | NaN | NaN | 0.0 | Consumer credit | -906.0 | 0.0 |
| 4 | 100002 | 1 | 6158908.0 | Closed | currency 1 | -645.0 | 0.0 | 85.0 | -36.0 | 5043.645 | 0.0 | 120735.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -34.0 | 0.0 |
# Generating bureau merged categorical dataframe
bureau_categorical_col=["TARGET","CREDIT_ACTIVE","CREDIT_CURRENCY","CNT_CREDIT_PROLONG","CREDIT_TYPE"]
bureau_merged_cat=bureau_merged[bureau_categorical_col]
# Phi-k correaltion matrix between categorical variables and target label
phik_overview = bureau_merged_cat.phik_matrix()
phik_overview.round(2)
# Plotting Phi-k correlation matrix
mask_array = np.ones(phik_overview.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.color_palette("Blues")
sns.heatmap(phik_overview,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation $\phi_K$ for bureau categorical variables")
plt.show()
print("-"*100)
bureau_top_categorical_var_lst=phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between categorical variable and Target variable sorted in descending order are :\n")
phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
interval columns not set, guessing: ['TARGET', 'CNT_CREDIT_PROLONG']
---------------------------------------------------------------------------------------------------- Correlation between categorical variable and Target variable sorted in descending order are :
TARGET 1.000000 CREDIT_ACTIVE 0.064481 CREDIT_TYPE 0.049954 CREDIT_CURRENCY 0.004993 CNT_CREDIT_PROLONG 0.003862 Name: TARGET, dtype: float64
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
bureau_categorical_col.remove('TARGET')
bureau_continuous=bureau_merged.drop(bureau_categorical_col,axis=1)
bureau_continuous=bureau_continuous.drop(["SK_ID_CURR","SK_ID_BUREAU"],axis=1)
# Computing the Pearson correlation coefficients
corr=bureau_continuous.corr()
# Plotting the correlation matrix
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(10, 10))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for Bureau continuous variables")
plt.show()
print("-"*100)
bureau_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
---------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 DAYS_CREDIT 0.061556 DAYS_CREDIT_UPDATE 0.041076 DAYS_ENDDATE_FACT 0.039057 DAYS_CREDIT_ENDDATE 0.026497 AMT_CREDIT_SUM 0.010606 AMT_CREDIT_SUM_OVERDUE 0.006253 AMT_CREDIT_SUM_LIMIT 0.005990 CREDIT_DAY_OVERDUE 0.002652 AMT_CREDIT_SUM_DEBT 0.002539 AMT_CREDIT_MAX_OVERDUE 0.001587 Name: TARGET, dtype: float64
Observations and Conclusions:
# CREDIT_ACTIVE
category_var_plot(bureau_merged,bureau_top_categorical_var_lst[1])
Number of unique category values of CREDIT_ACTIVE categorical variable are 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CREDIT_ACTIVE categorical variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# CREDIT_TYPE
category_var_plot(bureau_merged,bureau_top_categorical_var_lst[2])
Number of unique category values of CREDIT_TYPE categorical variable are 15 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CREDIT_TYPE categorical variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# CREDIT_CURRENCY
category_var_plot(bureau_merged,bureau_top_categorical_var_lst[3])
# CNT_CREDIT_PROLONG
category_var_plot(bureau_merged,bureau_top_categorical_var_lst[4])
Number of unique category values of CREDIT_CURRENCY categorical variable are 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CREDIT_CURRENCY categorical variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of unique category values of CNT_CREDIT_PROLONG categorical variable are 10 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CNT_CREDIT_PROLONG categorical variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
This variable represents the how many days before application did client applied for credit bureau credit.
DAYS_CREDIT = credit bureau application date - application date
For interpretation of results we will convert the data in positive year values.
# DAYS_CREDIT to YEARS_CREDIT
bureau_merged["YEARS_CREDIT"]=bureau_merged[bureau_top_continuous_var_lst[1]]*-1/365
plot_continuous_variables(bureau_merged,"YEARS_CREDIT")
bureau_merged=bureau_merged.drop(["YEARS_CREDIT"],axis=1)
Continuous variable YEARS_CREDIT has 44020 NaN rows. Percentage of NaN rows for YEARS_CREDIT continuous variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days before application did last information about the credit bureau credit received.
For interpretation of results we will convert the data in positive month values.
# DAYS_CREDIT_UPDATE to Months_CREDIT_UPDATE
bureau_merged["Months_CREDIT_UPDATE"]=bureau_merged[bureau_top_continuous_var_lst[2]]*-1/30
plot_continuous_variables(bureau_merged,"Months_CREDIT_UPDATE","Yes",(-20,150),(-20,150))
bureau_merged=bureau_merged.drop(["Months_CREDIT_UPDATE"],axis=1)
Continuous variable Months_CREDIT_UPDATE has 44020 NaN rows. Percentage of NaN rows for Months_CREDIT_UPDATE continuous variable are 2.92 % Percentage of 0 labels for NaN value row is 89.88 % Percentage of 1 labels for NaN value row is 10.12 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days before application did credit ended in credit bureau.
For interpretation of results we will convert the data in positive month values.
# DAYS_ENDDATE_FACT to Months_ENDDATE_FACT
bureau_merged["Months_ENDDATE_FACT"]=bureau_merged[bureau_top_continuous_var_lst[3]]*-1/30
plot_continuous_variables(bureau_merged,"Months_ENDDATE_FACT","Yes",(-20,150),(0,150))
bureau_merged=bureau_merged.drop(["Months_ENDDATE_FACT"],axis=1)
Continuous variable Months_ENDDATE_FACT has 588693 NaN rows. Percentage of NaN rows for Months_ENDDATE_FACT continuous variable are 39.0 % Percentage of 0 labels for NaN value row is 90.64 % Percentage of 1 labels for NaN value row is 9.36 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days are remaining for credit bureau credit at the time of application.
For interpretation of results we will convert the data in positive month values.
# DAYS_CREDIT_ENDDATE to MONTHS_CREDIT_ENDDATE
bureau_merged["MONTHS_CREDIT_ENDDATE"]=bureau_merged[bureau_top_continuous_var_lst[4]]/30
plot_continuous_variables(bureau_merged,"MONTHS_CREDIT_ENDDATE","Yes",(-150,500),(-200,200))
bureau_merged=bureau_merged.drop(["MONTHS_CREDIT_ENDDATE"],axis=1)
Continuous variable MONTHS_CREDIT_ENDDATE has 133118 NaN rows. Percentage of NaN rows for MONTHS_CREDIT_ENDDATE continuous variable are 8.82 % Percentage of 0 labels for NaN value row is 89.66 % Percentage of 1 labels for NaN value row is 10.34 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
df=bureau_merged[["TARGET","DAYS_CREDIT","DAYS_CREDIT_UPDATE","DAYS_ENDDATE_FACT"]]
df["YEARS_CREDIT"]=df["DAYS_CREDIT"]*-1/365
df["MONTHS_CREDIT_UPDATE"]=df["DAYS_CREDIT_UPDATE"]*-1/30
df["MONTHS_ENDDATE_FACT"]=df["DAYS_ENDDATE_FACT"]*-1/30
df=df.drop(["DAYS_CREDIT","DAYS_CREDIT_UPDATE","DAYS_ENDDATE_FACT"],axis=1)
df = df[df['MONTHS_CREDIT_UPDATE'] >= -20]
df = df[df['MONTHS_CREDIT_UPDATE'] <= 150]
df = df[df['MONTHS_ENDDATE_FACT'] >= 0]
df = df[df['MONTHS_ENDDATE_FACT'] <= 150]
pairPlotFn(df,"Bureau")
df=bureau_merged[["TARGET","DAYS_CREDIT","DAYS_CREDIT_UPDATE","DAYS_ENDDATE_FACT"]]
df["YEARS_CREDIT"]=df["DAYS_CREDIT"]*-1/365
df["MONTHS_CREDIT_UPDATE"]=df["DAYS_CREDIT_UPDATE"]*-1/30
df["MONTHS_ENDDATE_FACT"]=df["DAYS_ENDDATE_FACT"]*-1/30
df=df.drop(["DAYS_CREDIT","DAYS_CREDIT_UPDATE","DAYS_ENDDATE_FACT"],axis=1)
df = df[df['MONTHS_CREDIT_UPDATE'] >= -20]
df = df[df['MONTHS_CREDIT_UPDATE'] <= 150]
df = df[df['MONTHS_ENDDATE_FACT'] >= 0]
df = df[df['MONTHS_ENDDATE_FACT'] <= 150]
pairPlotFn(df,"Bureau Merged")
Observations and Conclusions:
bureau_balance=pd.read_csv("bureau_balance.csv")
bureau_balance
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
| ... | ... | ... | ... |
| 27299920 | 5041336 | -47 | X |
| 27299921 | 5041336 | -48 | X |
| 27299922 | 5041336 | -49 | X |
| 27299923 | 5041336 | -50 | X |
| 27299924 | 5041336 | -51 | X |
27299925 rows × 3 columns
print("Number of unique values of SK_ID_BUREAU in bureau balance are ",bureau_balance["SK_ID_BUREAU"].nunique())
com_len=len(list(set(bureau_balance["SK_ID_BUREAU"]) & set(bureau["SK_ID_BUREAU"])))
print("Common id's between bureau and bureau balance are ",com_len)
print("Corresponding percentage is ",round(com_len/len(set(bureau["SK_ID_BUREAU"]))*100,2))
print("-"*100)
print("Number of unique values of STATUS variable are \n",bureau_balance["STATUS"].value_counts().index.tolist())
Number of unique values of SK_ID_BUREAU in bureau balance are 817395 Common id's between bureau and bureau balance are 774354 Corresponding percentage is 45.11 ---------------------------------------------------------------------------------------------------- Number of unique values of STATUS variable are ['C', '0', 'X', '1', '5', '2', '3', '4']
plot_nan_pct(bureau_balance,"Bureau Balance")
Dataframe Bureau Balance does not have any NaN variable
Observations and Conclusions:
This dataframe have all previous applications for Home Credit loans of clients who have applied for loan.
previous_application=pd.read_csv("previous_application.csv")
previous_application
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.000000 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1670209 | 2300464 | 352015 | Consumer loans | 14704.290 | 267295.5 | 311400.0 | 0.0 | 267295.5 | WEDNESDAY | 12 | Y | 1 | 0.000000 | NaN | NaN | XAP | Approved | -544 | Cash through the bank | XAP | NaN | Refreshed | Furniture | POS | XNA | Stone | 43 | Furniture | 30.0 | low_normal | POS industry with interest | 365243.0 | -508.0 | 362.0 | -358.0 | -351.0 | 0.0 |
| 1670210 | 2357031 | 334635 | Consumer loans | 6622.020 | 87750.0 | 64291.5 | 29250.0 | 87750.0 | TUESDAY | 15 | Y | 1 | 0.340554 | NaN | NaN | XAP | Approved | -1694 | Cash through the bank | XAP | Unaccompanied | New | Furniture | POS | XNA | Stone | 43 | Furniture | 12.0 | middle | POS industry with interest | 365243.0 | -1604.0 | -1274.0 | -1304.0 | -1297.0 | 0.0 |
| 1670211 | 2659632 | 249544 | Consumer loans | 11520.855 | 105237.0 | 102523.5 | 10525.5 | 105237.0 | MONDAY | 12 | Y | 1 | 0.101401 | NaN | NaN | XAP | Approved | -1488 | Cash through the bank | XAP | Spouse, partner | Repeater | Consumer Electronics | POS | XNA | Country-wide | 1370 | Consumer electronics | 10.0 | low_normal | POS household with interest | 365243.0 | -1457.0 | -1187.0 | -1187.0 | -1181.0 | 0.0 |
| 1670212 | 2785582 | 400317 | Cash loans | 18821.520 | 180000.0 | 191880.0 | NaN | 180000.0 | WEDNESDAY | 9 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -1185 | Cash through the bank | XAP | Family | Repeater | XNA | Cash | x-sell | AP+ (Cash loan) | -1 | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -1155.0 | -825.0 | -825.0 | -817.0 | 1.0 |
| 1670213 | 2418762 | 261212 | Cash loans | 16431.300 | 360000.0 | 360000.0 | NaN | 360000.0 | SUNDAY | 10 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -1193 | Cash through the bank | XAP | Family | Repeater | XNA | Cash | x-sell | AP+ (Cash loan) | -1 | XNA | 48.0 | middle | Cash X-Sell: middle | 365243.0 | -1163.0 | 247.0 | -443.0 | -423.0 | 0.0 |
1670214 rows × 37 columns
print("Shape of previous application dataframe is ",previous_application.shape)
print("Number of unique values of SK_ID_PREV in previous application are ",previous_application["SK_ID_PREV"].nunique())
# lets understand the common id's between application_train.csv and previous_application
com_len_train=len(list(set(application_train["SK_ID_CURR"]) & set(previous_application["SK_ID_CURR"])))
print("Common id's between application_train and previous_application are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(application_train["SK_ID_CURR"]))*100,2))
Shape of previous application dataframe is (1670214, 37) Number of unique values of SK_ID_PREV in previous application are 1670214 Common id's between application_train and previous_application are 291057 Corresponding percentage is 94.65
#NaN variable percentatge in dataframe
plot_nan_pct(previous_application,"Previous Application")
Number of variables having NaN samples are 15
Observations and Conclusions:
Lets assign the "TARGET"variable to each previous application sample. We will use application_train and previous application common ID to assign the Target label. We will remove the dataframe samples for which target label is not assigned.
After merging we will undesrstand the correlation between target variable and categorical variables and between target variable and continuous variables.
print("Merging TARGET with Previous application dataframe")
previous_application_merged = application_train.iloc[:,:2].merge(previous_application, on = 'SK_ID_CURR', how = 'left')
print("Percentage of {} labels for NaN value row is {} %".format(previous_application_merged["TARGET"].value_counts().index.tolist()[0],round(previous_application_merged["TARGET"].value_counts()[0]/previous_application_merged.shape[0]*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(previous_application_merged["TARGET"].value_counts().index.tolist()[1],round(previous_application_merged["TARGET"].value_counts()[1]/previous_application_merged.shape[0]*100,2)))
print("Applicants labelled 1 are defaulters.")
print("-"*100)
previous_application_merged.head()
Merging TARGET with Previous application dataframe Percentage of 0 labels for NaN value row is 91.38 % Percentage of 1 labels for NaN value row is 8.62 % Applicants labelled 1 are defaulters. ----------------------------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | SK_ID_PREV | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 1038818.0 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 0.0 | 179055.0 | SATURDAY | 9.0 | Y | 1.0 | 0.000000 | NaN | NaN | XAP | Approved | -606.0 | XNA | XAP | NaN | New | Vehicles | POS | XNA | Stone | 500.0 | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 1 | 100003 | 0 | 1810518.0 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | NaN | 900000.0 | FRIDAY | 12.0 | Y | 1.0 | NaN | NaN | NaN | XNA | Approved | -746.0 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1.0 | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 |
| 2 | 100003 | 0 | 2636178.0 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 0.0 | 337500.0 | SUNDAY | 17.0 | Y | 1.0 | 0.000000 | NaN | NaN | XAP | Approved | -828.0 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400.0 | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 3 | 100003 | 0 | 2396755.0 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 6885.0 | 68809.5 | SATURDAY | 15.0 | Y | 1.0 | 0.100061 | NaN | NaN | XAP | Approved | -2341.0 | Cash through the bank | XAP | Family | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200.0 | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 |
| 4 | 100004 | 0 | 1564014.0 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 4860.0 | 24282.0 | FRIDAY | 5.0 | Y | 1.0 | 0.212008 | NaN | NaN | XAP | Approved | -815.0 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30.0 | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
# Generating Previous Application categorical dataframe
previous_application_categorical_col=["TARGET","NAME_CONTRACT_TYPE","WEEKDAY_APPR_PROCESS_START","HOUR_APPR_PROCESS_START","FLAG_LAST_APPL_PER_CONTRACT","NFLAG_LAST_APPL_IN_DAY",\
"NAME_CASH_LOAN_PURPOSE","NAME_CONTRACT_STATUS","NAME_PAYMENT_TYPE","CODE_REJECT_REASON","NAME_TYPE_SUITE",\
"NAME_CLIENT_TYPE","NAME_GOODS_CATEGORY","NAME_PORTFOLIO","NAME_PRODUCT_TYPE","CHANNEL_TYPE","NAME_SELLER_INDUSTRY",\
"CNT_PAYMENT","NAME_YIELD_GROUP","PRODUCT_COMBINATION","NFLAG_INSURED_ON_APPROVAL"]
previous_application_merged_cat=previous_application_merged[previous_application_categorical_col]
# Phi-k correaltion matrix between categorical variables and target label
phik_overview = previous_application_merged_cat.phik_matrix()
phik_overview.round(2)
# Plotting Phi-k correlation matrix
mask_array = np.ones(phik_overview.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.color_palette("Blues")
sns.heatmap(phik_overview,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation $\phi_K$ for Previous Application categorical variables")
plt.show()
print("-"*100)
previous_application_top_categorical_var_lst=phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between categorical variable and Target variable sorted in descending order are :\n")
phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
interval columns not set, guessing: ['TARGET', 'HOUR_APPR_PROCESS_START', 'NFLAG_LAST_APPL_IN_DAY', 'CNT_PAYMENT', 'NFLAG_INSURED_ON_APPROVAL']
---------------------------------------------------------------------------------------------------- Correlation between categorical variable and Target variable sorted in descending order are :
TARGET 1.000000 NAME_CONTRACT_STATUS 0.088266 PRODUCT_COMBINATION 0.063839 CODE_REJECT_REASON 0.062771 CNT_PAYMENT 0.056639 NAME_CONTRACT_TYPE 0.050859 CHANNEL_TYPE 0.050302 NAME_GOODS_CATEGORY 0.042951 NAME_CASH_LOAN_PURPOSE 0.040305 NAME_PAYMENT_TYPE 0.039752 HOUR_APPR_PROCESS_START 0.038121 Name: TARGET, dtype: float64
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
previous_application_categorical_col.remove('TARGET')
previous_application_continuous=previous_application_merged.drop(previous_application_categorical_col,axis=1)
previous_application_continuous=previous_application_continuous.drop(["SK_ID_CURR","SK_ID_PREV"],axis=1)
# Computing the Pearson correlation coefficients
corr=previous_application_continuous.corr()
# Plotting the correlation matrix
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for Previous Application continuous variables")
plt.show()
print("-"*100)
previous_application_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
---------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 DAYS_DECISION 0.039901 DAYS_FIRST_DRAWING 0.031154 RATE_INTEREST_PRIVILEGED 0.028640 RATE_DOWN_PAYMENT 0.026111 DAYS_LAST_DUE_1ST_VERSION 0.018021 DAYS_LAST_DUE 0.017522 DAYS_TERMINATION 0.016981 AMT_DOWN_PAYMENT 0.016918 AMT_ANNUITY 0.014922 DAYS_FIRST_DUE 0.006651 Name: TARGET, dtype: float64
Observations and Conclusions:
# NAME_CONTRACT_STATUS
category_var_plot(previous_application_merged,previous_application_top_categorical_var_lst[1])
Number of unique category values of NAME_CONTRACT_STATUS categorical variable are 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for NAME_CONTRACT_STATUS categorical variable are 1.15 % Percentage of 0 labels for NaN value row is 94.04 % Percentage of 1 labels for NaN value row is 5.96 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# PRODUCT_COMBINATION
category_var_plot(previous_application_merged,previous_application_top_categorical_var_lst[2])
Number of unique category values of PRODUCT_COMBINATION categorical variable are 17 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for PRODUCT_COMBINATION categorical variable are 1.17 % Percentage of 0 labels for NaN value row is 93.78 % Percentage of 1 labels for NaN value row is 6.22 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# CODE_REJECT_REASON
category_var_plot(previous_application_merged,previous_application_top_categorical_var_lst[3])
Number of unique category values of CODE_REJECT_REASON categorical variable are 9 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CODE_REJECT_REASON categorical variable are 1.15 % Percentage of 0 labels for NaN value row is 94.04 % Percentage of 1 labels for NaN value row is 5.96 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# CNT_PAYMENT
category_var_plot(previous_application_merged,previous_application_top_categorical_var_lst[4])
Number of unique category values of CNT_PAYMENT categorical variable are 48 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for CNT_PAYMENT categorical variable are 22.63 % Percentage of 0 labels for NaN value row is 90.45 % Percentage of 1 labels for NaN value row is 9.55 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
# NAME_CONTRACT_TYPE
category_var_plot(previous_application_merged,previous_application_top_categorical_var_lst[5])
Number of unique category values of NAME_CONTRACT_TYPE categorical variable are 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for NAME_CONTRACT_TYPE categorical variable are 1.15 % Percentage of 0 labels for NaN value row is 94.04 % Percentage of 1 labels for NaN value row is 5.96 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
This variable represents the how many days before current application did the decision about previous application made.
DAYS_DECISION = Previous application decision made date - application date
For interpretation of results we will convert the data in positive year values.
# DAYS_DECISION to YEARS_DECISION
previous_application_merged["YEARS_DECISION"]=previous_application_merged[previous_application_top_continuous_var_lst[1]]*-1/365
plot_continuous_variables(previous_application_merged,"YEARS_DECISION")
previous_application_merged=previous_application_merged.drop(["YEARS_DECISION"],axis=1)
Continuous variable YEARS_DECISION has 16454 NaN rows. Percentage of NaN rows for YEARS_DECISION continuous variable are 1.15 % Percentage of 0 labels for NaN value row is 94.04 % Percentage of 1 labels for NaN value row is 5.96 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days before current application did the first disbursment of previous application happened.
DAYS_FIRST_DRAWING = Previous application disbursment date - application date
For interpretation of results we will convert the data in year values.
# DAYS_FIRST_DRAWING to YEARS_FIRST_DRAWING
previous_application_merged["YEARS_FIRST_DRAWING"]=previous_application_merged[previous_application_top_continuous_var_lst[2]]/365
plot_continuous_variables(previous_application_merged,"YEARS_FIRST_DRAWING","Yes",(900,1100),(900,1100))
previous_application_merged=previous_application_merged.drop(["YEARS_FIRST_DRAWING"],axis=1)
Continuous variable YEARS_FIRST_DRAWING has 577560 NaN rows. Percentage of NaN rows for YEARS_FIRST_DRAWING continuous variable are 40.38 % Percentage of 0 labels for NaN value row is 89.85 % Percentage of 1 labels for NaN value row is 10.15 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# RATE_INTEREST_PRIVILEGED
plot_continuous_variables(previous_application_merged,previous_application_top_continuous_var_lst[3])
Continuous variable RATE_INTEREST_PRIVILEGED has 1425364 NaN rows. Percentage of NaN rows for RATE_INTEREST_PRIVILEGED continuous variable are 99.67 % Percentage of 0 labels for NaN value row is 91.37 % Percentage of 1 labels for NaN value row is 8.63 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# RATE_DOWN_PAYMENT
plot_continuous_variables(previous_application_merged,previous_application_top_continuous_var_lst[4])
Continuous variable RATE_DOWN_PAYMENT has 765994 NaN rows. Percentage of NaN rows for RATE_DOWN_PAYMENT continuous variable are 53.56 % Percentage of 0 labels for NaN value row is 90.53 % Percentage of 1 labels for NaN value row is 9.47 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days before current application did the first due of the previous application happened.
DAYS_LAST_DUE_1ST_VERSION = Previous applications first due date - application date
For interpretation of results we will convert the data in positive year values.
# DAYS_LAST_DUE_1ST_VERSION to Years_LAST_DUE_1ST_VERSION
previous_application_merged["Years_LAST_DUE_1ST_VERSION"]=previous_application_merged[previous_application_top_continuous_var_lst[5]]*-1/365
print_percentiles(previous_application_merged,"Years_LAST_DUE_1ST_VERSION",percentiles=[0,9,10,25,50,75,90,100])
plot_continuous_variables(previous_application_merged,"Years_LAST_DUE_1ST_VERSION","Yes",(-1200,200),(-10,10))
previous_application_merged=previous_application_merged.drop(["Years_LAST_DUE_1ST_VERSION"],axis=1)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The 0th percentile value of Years_LAST_DUE_1ST_VERSION is -1000.6657534246575 The 9th percentile value of Years_LAST_DUE_1ST_VERSION is -1000.6657534246575 The 10th percentile value of Years_LAST_DUE_1ST_VERSION is -3.7643835616438355 The 25th percentile value of Years_LAST_DUE_1ST_VERSION is -0.3698630136986301 The 50th percentile value of Years_LAST_DUE_1ST_VERSION is 0.9808219178082191 The 75th percentile value of Years_LAST_DUE_1ST_VERSION is 3.408219178082192 The 90th percentile value of Years_LAST_DUE_1ST_VERSION is 5.602739726027397 The 100th percentile value of Years_LAST_DUE_1ST_VERSION is 7.673972602739726 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Continuous variable Years_LAST_DUE_1ST_VERSION has 577560 NaN rows. Percentage of NaN rows for Years_LAST_DUE_1ST_VERSION continuous variable are 40.38 % Percentage of 0 labels for NaN value row is 89.85 % Percentage of 1 labels for NaN value row is 10.15 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
df=previous_application_merged[["TARGET","DAYS_DECISION","DAYS_FIRST_DRAWING","RATE_INTEREST_PRIVILEGED"]]
df["YEARS_DECISION"]=df["DAYS_DECISION"]*-1/365
df["YEARS_FIRST_DRAWING"]=df["DAYS_FIRST_DRAWING"]/365
df=df.drop(["DAYS_DECISION","DAYS_FIRST_DRAWING"],axis=1)
df = df[df['YEARS_FIRST_DRAWING'] >= 900]
df = df[df['YEARS_FIRST_DRAWING'] <= 1100]
pairPlotFn(df,"Previous Application")
Observations and Conclusions:
installments_payments=pd.read_csv("installments_payments.csv")
installments_payments
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13605396 | 2186857 | 428057 | 0.0 | 66 | -1624.0 | NaN | 67.500 | NaN |
| 13605397 | 1310347 | 414406 | 0.0 | 47 | -1539.0 | NaN | 67.500 | NaN |
| 13605398 | 1308766 | 402199 | 0.0 | 43 | -7.0 | NaN | 43737.435 | NaN |
| 13605399 | 1062206 | 409297 | 0.0 | 43 | -1986.0 | NaN | 67.500 | NaN |
| 13605400 | 2448869 | 434321 | 1.0 | 19 | -27.0 | NaN | 11504.250 | NaN |
13605401 rows × 8 columns
print("Shape of Installments Payments dataframe is ",installments_payments.shape)
print("Number of unique values of SK_ID_PREV in Installments Payments are ",installments_payments["SK_ID_PREV"].nunique())
# lets understand the common id's between application_train.csv and Installments Payments
com_len_train=len(list(set(application_train["SK_ID_CURR"]) & set(installments_payments["SK_ID_CURR"])))
print("Common id's between application_train and installments_payments are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(application_train["SK_ID_CURR"]))*100,2))
print("-"*185)
# lets understand the common id's between previous applications and Installments Payments
com_len_train=len(list(set(previous_application["SK_ID_PREV"]) & set(installments_payments["SK_ID_PREV"])))
print("Common id's between previous_application and installments_payments are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(previous_application["SK_ID_PREV"]))*100,2))
Shape of Installments Payments dataframe is (13605401, 8) Number of unique values of SK_ID_PREV in Installments Payments are 997752 Common id's between application_train and installments_payments are 291643 Corresponding percentage is 94.84 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Common id's between previous_application and installments_payments are 958905 Corresponding percentage is 57.41
#NaN variable percentatge in dataframe
plot_nan_pct(installments_payments,"Installments Payments")
Number of variables having NaN samples are 2
Observations and Conclusions:
Lets assign the "TARGET"variable to each installment payment sample. We will use application_train and installment payments common ID to assign the Target label. We will remove the dataframe samples for which target label is not assigned.
After merging we will undesrstand the correlation between target variable and continuous variables.
print("Merging TARGET with Installment Payments dataframe")
installments_payments_merged = application_train.iloc[:,:2].merge(installments_payments, on = 'SK_ID_CURR', how = 'left')
print("Percentage of {} labels for NaN value row is {} %".format(installments_payments_merged["TARGET"].value_counts().index.tolist()[0],round(installments_payments_merged["TARGET"].value_counts()[0]/installments_payments_merged.shape[0]*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(installments_payments_merged["TARGET"].value_counts().index.tolist()[1],round(installments_payments_merged["TARGET"].value_counts()[1]/installments_payments_merged.shape[0]*100,2)))
print("Applicants labelled 1 are defaulters.")
print("-"*100)
installments_payments_merged.head()
Merging TARGET with Installment Payments dataframe Percentage of 0 labels for NaN value row is 92.41 % Percentage of 1 labels for NaN value row is 7.59 % Applicants labelled 1 are defaulters. ----------------------------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | SK_ID_PREV | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 1038818.0 | 1.0 | 16.0 | -115.0 | -133.0 | 9251.775 | 9251.775 |
| 1 | 100002 | 1 | 1038818.0 | 1.0 | 8.0 | -355.0 | -375.0 | 9251.775 | 9251.775 |
| 2 | 100002 | 1 | 1038818.0 | 2.0 | 19.0 | -25.0 | -49.0 | 53093.745 | 53093.745 |
| 3 | 100002 | 1 | 1038818.0 | 1.0 | 12.0 | -235.0 | -250.0 | 9251.775 | 9251.775 |
| 4 | 100002 | 1 | 1038818.0 | 1.0 | 9.0 | -325.0 | -344.0 | 9251.775 | 9251.775 |
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
installments_payments_continuous=installments_payments_merged.drop(["SK_ID_CURR","SK_ID_PREV"],axis=1)
# Computing the Pearson correlation coefficients
corr=installments_payments_continuous.corr()
# Plotting the correlation matrix
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for Installments Payments continuous variables")
plt.show()
print("-"*100)
previous_application_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
---------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 DAYS_ENTRY_PAYMENT 0.035122 DAYS_INSTALMENT 0.034974 NUM_INSTALMENT_NUMBER 0.016190 NUM_INSTALMENT_VERSION 0.009896 AMT_PAYMENT 0.003623 AMT_INSTALMENT 0.001498 Name: TARGET, dtype: float64
Observations and Conclusions:
This variable represents the how many days before current application did the actual installment credit payment happened.
DAYS_ENTRY_PAYMENT = Previous credits installment paid date - application date
For interpretation of results we will convert the data in positive year values.
# DAYS_ENTRY_PAYMENT to YEARS_ENTRY_PAYMENT
installments_payments_merged["YEARS_ENTRY_PAYMENT"]=installments_payments_merged[previous_application_top_continuous_var_lst[1]]*-1/365
plot_continuous_variables(installments_payments_merged,"YEARS_ENTRY_PAYMENT")
installments_payments_merged=installments_payments_merged.drop(["YEARS_ENTRY_PAYMENT"],axis=1)
Continuous variable YEARS_ENTRY_PAYMENT has 18451 NaN rows. Percentage of NaN rows for YEARS_ENTRY_PAYMENT continuous variable are 0.16 % Percentage of 0 labels for NaN value row is 91.65 % Percentage of 1 labels for NaN value row is 8.35 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
This variable represents the how many days before current application did the installment credit payment supposed to happen.
DAYS_INSTALMENT = Previous credits installment payment requested date - application date
For interpretation of results we will convert the data in positive year values.
# DAYS_INSTALMENT to YEARS_INSTALMENT
installments_payments_merged["YEARS_INSTALMENT"]=installments_payments_merged[previous_application_top_continuous_var_lst[2]]*-1/365
plot_continuous_variables(installments_payments_merged,"YEARS_INSTALMENT")
installments_payments_merged=installments_payments_merged.drop(["YEARS_INSTALMENT"],axis=1)
Continuous variable YEARS_INSTALMENT has 15868 NaN rows. Percentage of NaN rows for YEARS_INSTALMENT continuous variable are 0.14 % Percentage of 0 labels for NaN value row is 94.02 % Percentage of 1 labels for NaN value row is 5.98 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
df=installments_payments_merged[["TARGET","DAYS_ENTRY_PAYMENT","DAYS_INSTALMENT"]]
df["YEARS_ENTRY_PAYMENT"]=df["DAYS_ENTRY_PAYMENT"]*-1/365
df["YEARS_INSTALMENT"]=df["DAYS_INSTALMENT"]*-1/365
df=df.drop(["DAYS_ENTRY_PAYMENT","DAYS_INSTALMENT"],axis=1)
pairPlotFn(df,"Installment Payments")
Observations and Conclusions:
POS_CASH_balance=pd.read_csv("POS_CASH_balance.csv")
POS_CASH_balance
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10001353 | 2448283 | 226558 | -20 | 6.0 | 0.0 | Active | 843 | 0 |
| 10001354 | 1717234 | 141565 | -19 | 12.0 | 0.0 | Active | 602 | 0 |
| 10001355 | 1283126 | 315695 | -21 | 10.0 | 0.0 | Active | 609 | 0 |
| 10001356 | 1082516 | 450255 | -22 | 12.0 | 0.0 | Active | 614 | 0 |
| 10001357 | 1259607 | 174278 | -52 | 16.0 | 0.0 | Completed | 0 | 0 |
10001358 rows × 8 columns
print("Shape of POS_CASH_balance dataframe is ",POS_CASH_balance.shape)
print("Number of unique values of SK_ID_PREV in POS_CASH_balance are ",POS_CASH_balance["SK_ID_PREV"].nunique())
# lets understand the common id's between application_train.csv and POS_CASH_balance
com_len_train=len(list(set(application_train["SK_ID_CURR"]) & set(POS_CASH_balance["SK_ID_CURR"])))
print("Common id's between application_train and POS_CASH_balance are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(application_train["SK_ID_CURR"]))*100,2))
print("-"*100)
# lets understand the common id's between previous applications and POS_CASH_balance
com_len_train=len(list(set(previous_application["SK_ID_PREV"]) & set(POS_CASH_balance["SK_ID_PREV"])))
print("Common id's between previous_application and POS_CASH_balance are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(previous_application["SK_ID_PREV"]))*100,2))
Shape of POS_CASH_balance dataframe is (10001358, 8) Number of unique values of SK_ID_PREV in POS_CASH_balance are 936325 Common id's between application_train and POS_CASH_balance are 289444 Corresponding percentage is 94.12 ---------------------------------------------------------------------------------------------------- Common id's between previous_application and POS_CASH_balance are 898903 Corresponding percentage is 53.82
#NaN variable percentatge in dataframe
plot_nan_pct(POS_CASH_balance,"POS_CASH_balance")
Number of variables having NaN samples are 2
Observations and Conclusions:
Lets assign the "TARGET"variable to each POS_CASH_balance sample. We will use application_train and POS_CASH_balance common ID to assign the Target label. We will remove the dataframe samples for which target label is not assigned.
After merging we will undesrstand the correlation between target variable and continuous variables.
print("Merging TARGET with POS_CASH_balance dataframe")
POS_CASH_balance_merged = application_train.iloc[:,:2].merge(POS_CASH_balance, on = 'SK_ID_CURR', how = 'left')
print("Percentage of {} labels for NaN value row is {} %".format(POS_CASH_balance_merged["TARGET"].value_counts().index.tolist()[0],round(POS_CASH_balance_merged["TARGET"].value_counts()[0]/POS_CASH_balance_merged.shape[0]*100,2)))
print("Percentage of {} labels for NaN value row is {} %".format(POS_CASH_balance_merged["TARGET"].value_counts().index.tolist()[1],round(POS_CASH_balance_merged["TARGET"].value_counts()[1]/POS_CASH_balance_merged.shape[0]*100,2)))
print("Applicants labelled 1 are defaulters.")
print("-"*100)
POS_CASH_balance_merged.head()
Merging TARGET with POS_CASH_balance dataframe Percentage of 0 labels for NaN value row is 92.64 % Percentage of 1 labels for NaN value row is 7.36 % Applicants labelled 1 are defaulters. ----------------------------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | SK_ID_PREV | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 1038818.0 | -13.0 | 24.0 | 18.0 | Active | 0.0 | 0.0 |
| 1 | 100002 | 1 | 1038818.0 | -16.0 | 24.0 | 21.0 | Active | 0.0 | 0.0 |
| 2 | 100002 | 1 | 1038818.0 | -18.0 | 24.0 | 23.0 | Active | 0.0 | 0.0 |
| 3 | 100002 | 1 | 1038818.0 | -17.0 | 24.0 | 22.0 | Active | 0.0 | 0.0 |
| 4 | 100002 | 1 | 1038818.0 | -5.0 | 24.0 | 10.0 | Active | 0.0 | 0.0 |
# Generating POS_CASH_balance categorical dataframe
POS_CASH_balance_categorical_col=["TARGET","NAME_CONTRACT_STATUS"]
POS_CASH_balance_merged_cat=POS_CASH_balance_merged[POS_CASH_balance_categorical_col]
# Phi-k correaltion matrix between categorical variables and target label
phik_overview = POS_CASH_balance_merged_cat.phik_matrix()
phik_overview.round(2)
POS_CASH_balance_top_categorical_var_lst=phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between categorical variable and Target variable sorted in descending order are :\n")
phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
interval columns not set, guessing: ['TARGET'] Correlation between categorical variable and Target variable sorted in descending order are :
TARGET 1.000000 NAME_CONTRACT_STATUS 0.009566 Name: TARGET, dtype: float64
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
POS_CASH_balance_continuous=POS_CASH_balance_merged.drop(["SK_ID_CURR","SK_ID_PREV","NAME_CONTRACT_STATUS"],axis=1)
# Computing the Pearson correlation coefficients
corr=POS_CASH_balance_continuous.corr()
# Plotting the correlation matrix
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for POS_CASH_balance continuous variables")
plt.show()
print("-"*100)
previous_application_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
---------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 CNT_INSTALMENT_FUTURE 0.021972 MONTHS_BALANCE 0.020147 CNT_INSTALMENT 0.018506 SK_DPD 0.009866 SK_DPD_DEF 0.008594 Name: TARGET, dtype: float64
Observations and Conclusions:
Amongst all continuous variables, CNT_INSTALMENT_FUTURE shows a high correlation wrt Target label.
We will plot the categorical variable plot and first 2 continuous variables plot wrt Target variable.
# NAME_CONTRACT_STATUS
category_var_plot(POS_CASH_balance_merged,POS_CASH_balance_top_categorical_var_lst[1])
Number of unique category values of NAME_CONTRACT_STATUS categorical variable are 9 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Dataframe has NaN rows. Percentage of NaN rows for NAME_CONTRACT_STATUS categorical variable are 0.21 % Percentage of 0 labels for NaN value row is 93.34 % Percentage of 1 labels for NaN value row is 6.66 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and conclusions :
plot_continuous_variables(POS_CASH_balance_merged,previous_application_top_continuous_var_lst[1])
Continuous variable CNT_INSTALMENT_FUTURE has 39945 NaN rows. Percentage of NaN rows for CNT_INSTALMENT_FUTURE continuous variable are 0.47 % Percentage of 0 labels for NaN value row is 91.7 % Percentage of 1 labels for NaN value row is 8.3 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
plot_continuous_variables(POS_CASH_balance_merged,previous_application_top_continuous_var_lst[2])
Continuous variable MONTHS_BALANCE has 18067 NaN rows. Percentage of NaN rows for MONTHS_BALANCE continuous variable are 0.21 % Percentage of 0 labels for NaN value row is 93.34 % Percentage of 1 labels for NaN value row is 6.66 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
df=POS_CASH_balance_merged[["TARGET","CNT_INSTALMENT_FUTURE","MONTHS_BALANCE"]]
pairPlotFn(df,"POS_CASH_BALANCE")
Observations and Conclusions:
credit_card_balance=pd.read_csv("credit_card_balance.csv")
credit_card_balance
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.00 | 1800.000 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.00 | 2250.000 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.00 | 2250.000 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.00 | 11925.000 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.00 | 27000.000 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3840307 | 1036507 | 328243 | -9 | 0.000 | 45000 | NaN | 0.0 | NaN | NaN | 0.000 | NaN | 0.000 | 0.000 | 0.000 | 0.000 | NaN | 0 | NaN | NaN | 0.0 | Active | 0 | 0 |
| 3840308 | 1714892 | 347207 | -9 | 0.000 | 45000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 1879.11 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 23.0 | Active | 0 | 0 |
| 3840309 | 1302323 | 215757 | -9 | 275784.975 | 585000 | 270000.0 | 270000.0 | 0.0 | 0.0 | 2250.000 | 375750.00 | 356994.675 | 269356.140 | 273093.975 | 273093.975 | 2.0 | 2 | 0.0 | 0.0 | 18.0 | Active | 0 | 0 |
| 3840310 | 1624872 | 430337 | -10 | 0.000 | 450000 | NaN | 0.0 | NaN | NaN | 0.000 | NaN | 0.000 | 0.000 | 0.000 | 0.000 | NaN | 0 | NaN | NaN | 0.0 | Active | 0 | 0 |
| 3840311 | 2411345 | 236760 | -10 | 0.000 | 157500 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.00 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 21.0 | Completed | 0 | 0 |
3840312 rows × 23 columns
print("Shape of Credit card balance dataframe is ",credit_card_balance.shape)
print("Number of unique values of SK_ID_PREV in Credit card balance are ",credit_card_balance["SK_ID_PREV"].nunique())
# lets understand the common id's between application_train.csv and Credit card balance
com_len_train=len(list(set(application_train["SK_ID_CURR"]) & set(credit_card_balance["SK_ID_CURR"])))
print("Common id's between application_train and Credit card balance are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(application_train["SK_ID_CURR"]))*100,2))
print("-"*100)
# lets understand the common id's between previous applications and Credit card balance
com_len_train=len(list(set(previous_application["SK_ID_PREV"]) & set(credit_card_balance["SK_ID_PREV"])))
print("Common id's between previous_application and Credit card balance are ",com_len_train)
print("Corresponding percentage is ",round(com_len_train/len(set(previous_application["SK_ID_PREV"]))*100,2))
Shape of Credit card balance dataframe is (3840312, 23) Number of unique values of SK_ID_PREV in Credit card balance are 104307 Common id's between application_train and Credit card balance are 86905 Corresponding percentage is 28.26 ---------------------------------------------------------------------------------------------------- Common id's between previous_application and Credit card balance are 92935 Corresponding percentage is 5.56
#NaN variable percentatge in dataframe
plot_nan_pct(credit_card_balance,"Credit card balance")
Number of variables having NaN samples are 9
Observations and Conclusions:
Lets assign the "TARGET"variable to each Credit card balance sample. We will use application_train and Credit card balance common ID to assign the Target label. We will remove the dataframe samples for which target label is not assigned.
After merging we will undesrstand the correlation between target variable and continuous variables.
print("Merging TARGET with Credit card balance dataframe")
credit_card_balance_merged = application_train.iloc[:,:2].merge(credit_card_balance, on = 'SK_ID_CURR', how = 'left')
print("Percentage of {} labeled rows is {} %".format(credit_card_balance_merged["TARGET"].value_counts().index.tolist()[0],round(credit_card_balance_merged["TARGET"].value_counts()[0]/credit_card_balance_merged.shape[0]*100,2)))
print("Percentage of {} labeled rows is {} %".format(credit_card_balance_merged["TARGET"].value_counts().index.tolist()[1],round(credit_card_balance_merged["TARGET"].value_counts()[1]/credit_card_balance_merged.shape[0]*100,2)))
print("Applicants labelled 1 are defaulters.")
print("-"*100)
credit_card_balance_merged.head()
Merging TARGET with Credit card balance dataframe Percentage of 0 labeled rows is 92.83 % Percentage of 1 labeled rows is 7.17 % Applicants labelled 1 are defaulters. ----------------------------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | SK_ID_PREV | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100003 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 100004 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 100006 | 0 | 1489396.0 | -2.0 | 0.0 | 270000.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | Active | 0.0 | 0.0 |
| 4 | 100006 | 0 | 1489396.0 | -1.0 | 0.0 | 270000.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | Active | 0.0 | 0.0 |
# Generating POS_CASH_balance categorical dataframe
credit_card_balance_merged_categorical_col=["TARGET","NAME_CONTRACT_STATUS"]
credit_card_balance_merged_merged_cat=credit_card_balance_merged[credit_card_balance_merged_categorical_col]
# Phi-k correaltion matrix between categorical variables and target label
phik_overview = credit_card_balance_merged_merged_cat.phik_matrix()
phik_overview.round(2)
POS_CASH_balance_top_categorical_var_lst=phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between categorical variable and Target variable sorted in descending order are :\n")
phik_overview.sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
interval columns not set, guessing: ['TARGET'] Correlation between categorical variable and Target variable sorted in descending order are :
TARGET 1.000000 NAME_CONTRACT_STATUS 0.008896 Name: TARGET, dtype: float64
# Lets similarly observe the correlation between continuous variables and Target variable using Pearson correlatino coefficient
credit_card_balance_continuous=credit_card_balance_merged.drop(["SK_ID_CURR","SK_ID_PREV","NAME_CONTRACT_STATUS"],axis=1)
# Computing the Pearson correlation coefficients
corr=credit_card_balance_continuous.corr()
# Plotting the correlation matrix
mask_array = np.ones(corr.shape)
mask_array = np.triu(mask_array)
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.color_palette("Blues")
sns.heatmap(corr,
mask=mask_array,
cmap=cmap,
vmax=.3,
center=0,
square=True,
linewidths=3,
cbar_kws={"shrink": .5}
)
plt.title("Correlation Pearson for Credit card balance continuous variables")
plt.show()
print("-"*100)
credit_card_balance_top_continuous_var_lst=corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11).index.tolist()
print("Correlation between continuous variable and Target variable sorted in descending order are :\n")
corr.abs().sort_values(["TARGET"], ascending=False)['TARGET'].head(11)
---------------------------------------------------------------------------------------------------- Correlation between continuous variable and Target variable sorted in descending order are :
TARGET 1.000000 AMT_BALANCE 0.050098 AMT_TOTAL_RECEIVABLE 0.049839 AMT_RECIVABLE 0.049803 AMT_RECEIVABLE_PRINCIPAL 0.049692 AMT_INST_MIN_REGULARITY 0.039798 CNT_DRAWINGS_ATM_CURRENT 0.038437 CNT_DRAWINGS_CURRENT 0.037793 MONTHS_BALANCE 0.035695 CNT_DRAWINGS_POS_CURRENT 0.029536 AMT_DRAWINGS_ATM_CURRENT 0.024700 Name: TARGET, dtype: float64
Observations and Conclusions:
# AMT_BALANCE
print_percentiles(credit_card_balance_merged,credit_card_balance_top_continuous_var_lst[1],percentiles=[0,10,25,50,75,90,100])
plot_continuous_variables(credit_card_balance_merged,credit_card_balance_top_continuous_var_lst[1],"Yes",(-250000,500000),(-250000,500000))
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The 0th percentile value of AMT_BALANCE is -420250.185 The 10th percentile value of AMT_BALANCE is 0.0 The 25th percentile value of AMT_BALANCE is 0.0 The 50th percentile value of AMT_BALANCE is 0.0 The 75th percentile value of AMT_BALANCE is 90405.54 The 90th percentile value of AMT_BALANCE is 180226.14300000004 The 100th percentile value of AMT_BALANCE is 1354829.265 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Continuous variable AMT_BALANCE has 220606 NaN rows. Percentage of NaN rows for AMT_BALANCE continuous variable are 6.4 % Percentage of 0 labels for NaN value row is 92.16 % Percentage of 1 labels for NaN value row is 7.84 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
# AMT_TOTAL_RECEIVABLE
print_percentiles(credit_card_balance_merged,credit_card_balance_top_continuous_var_lst[2],percentiles=[0,10,25,50,75,90,100])
plot_continuous_variables(credit_card_balance_merged,credit_card_balance_top_continuous_var_lst[2],"Yes",(-250000,500000),(-250000,500000))
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The 0th percentile value of AMT_TOTAL_RECEIVABLE is -420250.185 The 10th percentile value of AMT_TOTAL_RECEIVABLE is 0.0 The 25th percentile value of AMT_TOTAL_RECEIVABLE is 0.0 The 50th percentile value of AMT_TOTAL_RECEIVABLE is 0.0 The 75th percentile value of AMT_TOTAL_RECEIVABLE is 90226.35 The 90th percentile value of AMT_TOTAL_RECEIVABLE is 180051.597 The 100th percentile value of AMT_TOTAL_RECEIVABLE is 1354829.265 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Continuous variable AMT_TOTAL_RECEIVABLE has 220606 NaN rows. Percentage of NaN rows for AMT_TOTAL_RECEIVABLE continuous variable are 6.4 % Percentage of 0 labels for NaN value row is 92.16 % Percentage of 1 labels for NaN value row is 7.84 % Applicants labelled 1 are defaulters. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations and Conclusions :
df=credit_card_balance_merged[["TARGET","AMT_BALANCE","AMT_TOTAL_RECEIVABLE"]]
pairPlotFn(df,"Credit Card Balance")
Observations and Conclusions:
Based on the EDA observations and problem statement requirements, following performance metrics are choosen to understand the model performance:
Confusion matrix, Precision matrix and Recall matrix
The confusion matrix provides more insight into not only the performance of a predictive model but also which classes are being predicted correctly, which incorrectly, and what type of errors are being made. Precision matrix defines whther the predicted class belongs to the correct class or not and Recall matrix defines whther a true class is being correctly predicted or not.
As the data is imbalanced it is important to understand the classification of defaulters and nondefaulters into the correct classes. This is understood by confusion, precision and recall matrix.
F1 Score
F1 Score is the harmoninc mean of precision and recall. It is the important metrics for model performance understanding in case of an imbalanced data.
Area under the ROC curve
ROC AUC defines how effective an algorithm at seperating the classes. A ROC AUC curve is a diagnostic plot for summarizing the behavior of a model by calculating the false positive rate and true positive rate for a set of predictions by the model under different thresholds. This metric is also useful for understanding the model performance in case of class imbalance.
EDA of data gave a good insight into the available data. Conclusions from the descriptive statistics, correlation coefficients and variable plots observations are as follows: